printf: improve support of printing multi-byte values of characters #7048

sylvestre · 2025-01-01T16:27:40Z

No description provided.

github-actions · 2025-01-01T17:46:55Z

GNU testsuite comparison:

Congrats! The gnu test tests/printf/printf-mb is no longer failing!

github-actions · 2025-01-01T18:16:32Z

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

github-actions · 2025-01-01T19:02:29Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

sylvestre · 2025-01-02T10:17:49Z

@jtracey given your work on #7020 i am wondering if you would be able to help with the windows support ? :) thanks

jtracey · 2025-01-02T17:13:27Z

src/uu/printf/src/printf.rs

+    #[cfg(windows)]
+    let format_vec: Vec<u8> = format
+        .encode_wide()
+        .flat_map(|wchar| wchar.to_le_bytes())


This won't do what we want, because it's effectively casting UTF-16 into a byte array, and UTF-16 is not byte-compatible with UTF-8 (e.g., ASCII in UTF-8 is just the literal ASCII byte per char, while ASCII in UTF-16 will be two bytes each). Because invalid unicode is much rarer in Windows, the way to turn OsStr(ing)s into bytes for us is using os_str_as_bytes, which will error on invalid unicode on Windows, or os_str_as_bytes_lossy, which will turn invalid unicode sequences into the replacement character on Windows. Windows makes it difficult to pass invalid unicode as an argument, so whichever feels more appropriate should be fine here.

Also, a nit that won't be relevant after the change, but cfgs assuming unix or windows is a code smell, since there are platforms that are neither (none that we support yet, but I hope to change that at some point 😛).

jtracey · 2025-01-02T20:37:09Z

src/uu/printf/src/printf.rs

+                    .collect();
+                FormatArgument::Unparsed(
+                    String::from_utf8(raw_bytes.clone())
+                        .unwrap_or_else(|_| raw_bytes.iter().map(|&b| b as char).collect()),


I can't find it documented anywhere, but collecting b as char has unpredictable behavior. Casting a u8 to char casts to the char with the code point with that value, not the UTF-8 parsing of that value. This was made invisible because somewhere, Rust seems to be doing some kind of dynamic dispatch to instead collect the bytes into a UTF-8 string if the whole iterator is valid UTF-8. Here's a playground if you want a more detailed demonstration.

Ideally, the FormatArgument::Unparsed value should contain bytes or an OsString instead of a String, but assuming that's outside the scope of this PR, it's better to specify what should happen if a non-UTF-8 string is passed in, using to_string or to_string_lossy, avoiding the encoding of values as bytes entirely for now (this also avoids allocation overhead, since char is the size of a u32).

Should fix tests/printf/printf-mb.sh

github-actions · 2025-01-06T08:36:22Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/printf/printf-mb is no longer failing!

sylvestre force-pushed the printf-go branch 7 times, most recently from 95a1333 to 80988c3 Compare January 1, 2025 17:46

sylvestre force-pushed the printf-go branch from 80988c3 to a79d0d8 Compare January 1, 2025 17:50

sylvestre force-pushed the printf-go branch 3 times, most recently from 0e9e153 to 63eb50f Compare January 1, 2025 18:35

jtracey reviewed Jan 2, 2025

View reviewed changes

sylvestre added 5 commits January 6, 2025 09:08

printf: remove old clippy ignore

ea485bc

printf: Improve support for printing multi-byte values of characters

650eef9

printf: simplify and dedup some tests

7c589f3

printf: support for invalid utf-8 page

b8e8ef6

printf: support for extract chars

60f9570

Should fix tests/printf/printf-mb.sh

sylvestre force-pushed the printf-go branch from 63eb50f to 60f9570 Compare January 6, 2025 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

printf: improve support of printing multi-byte values of characters #7048

printf: improve support of printing multi-byte values of characters #7048

sylvestre commented Jan 1, 2025

github-actions bot commented Jan 1, 2025

github-actions bot commented Jan 1, 2025

github-actions bot commented Jan 1, 2025

sylvestre commented Jan 2, 2025

jtracey Jan 2, 2025

jtracey Jan 2, 2025

github-actions bot commented Jan 6, 2025

printf: improve support of printing multi-byte values of characters #7048

Are you sure you want to change the base?

printf: improve support of printing multi-byte values of characters #7048

Conversation

sylvestre commented Jan 1, 2025

github-actions bot commented Jan 1, 2025

github-actions bot commented Jan 1, 2025

github-actions bot commented Jan 1, 2025

sylvestre commented Jan 2, 2025

jtracey Jan 2, 2025

Choose a reason for hiding this comment

jtracey Jan 2, 2025

Choose a reason for hiding this comment

github-actions bot commented Jan 6, 2025